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Analysis on Non-negative Factorizations and Applications 

Yat Tin Chow * Kazufumi Ito^ Jun Zou* 


Abstract 

In this work we perform some mathematical analysis on non-negative matrix factorizations (NMF) and 
apply NMF to some imaging and inverse problems. We will propose a sparse low-rank approximation of big 
positive data and images in terms of tensor products of positive vectors, and investigate its effectiveness in 
terms of the number of tensor products to be used in the approximation. A new concept of multi-level analysis 
(MLA) framework is also suggested to extract major components in the matrix representing structures of 
different resolutions, but still preserving the positivity of the basis and sparsity of the approximation. We 
will also propose a semi-smooth Newton method based on primal-dual active sets for the non-negative 
factorization. Numerical results are given to demonstrate the effectiveness of the proposed method to 
capture features in images and structures of inverse problems under no a-priori assumption on the data 
structure, as well as to provide a sparse low-rank representation of the data. 

Mathematics Subject Classification (MSC2000): 15A23, 65F22, 65F30, 65F50, 78M25. 
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1 Introduction to Non-negative Factorizations 

Non-negative factorization (NMF) has attracted a great deal of attention in the last decade, in an attempt 
to tackle k-clustering problems and structural analysis of big data. It is very effective in extraction of principle 
components, features and similarities inside a large set of data or image. NMF was studied as early as in 
1994 21], and used for machine learning and data mining 1151 jji . The concept of NMF as k-means clustering 
for principle component analysis has been widely studied theoretically and numerically in literature, see, e.g., 
[111110 EB US HU HU; and the concept of tri-factorization was used as a concurrent column and row clustering 
(a.k.a. co-clustering) of data in [8). In order to extract desired features as well as to reduce memory complexity, 
sparsity is often imposed in NMF using Iq or U regularization. Effective NMF toolboxes have been also developed 
to provide different choices of regularizers and constraints, e.g., the non-negative matrix factorization toolbox 
in MATLAB [TSj. A convex model for NMF was suggested in 1, where the convex Zi i00 -norm is used as 
the regularizer to enforce row sparsity. In an application of this convex model to hyper-spectral end-members 
selections, the NMF succeeded to provide abundance maps of end-members representing different structures 
inside an image, e.g., roofs, trees, grass, soil and road. 

In general, an NMF of a given matrix Y € R ArxM is of the form 

Y ss AP , A e R Nxk , P g R fcxM , (1.1) 

where matrix P is non-negative component-wise. In most applications, we may require dimension k to be much 
smaller than the dimension of Y, i.e. k « min(V, M). P is regarded as a basis of the information contained 
in matrix Y. We may further impose P to be nearly orthornormal, i.e., PP T ss I. In this case, it is similar 
to a partition of unity in the underlying space and the vectors in P are similar to some indicator functions. In 
order to reduce memory complexity in storing the basis P, one may further add a sparsity constraint on P. 
The matrix A is an assignment matrix, which gives some special weighting to the corresponding vectors in P. 
It is our aim to obtain a sparse matrix A which has a very small number of non-zero entries. Therefore, A 
can be interpreted as some sparse assignments of linear combinations of basis vectors in P. If matrix Y is also 
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non-negative component-wise, we may further require A > 0 component-wise. This constraint may be infeasible 
if Y is not nonnegative, and in this case we shall relax and drop the non-negativity condition for A. The sparsity 
constraint on A ensures more concise information extraction. Moreover, we may also have a post-process to sort 
vectors of A in descending order in terms of magnitude, which can yield the most important bases of matrix P. 
Using a standard l± regularization to impose sparsity for A and P and near-orthogonality for P, the problem 
of NMF for a non-negative matrix Y can be reformulated as the following minimization problem: 

A> min >o ||F - AP \\ 2 Fi2 + a\\A\\ Ftl + u\\P\\ F ,i + l\\PP T - I\\f,i (1-2) 

over nonnegative matrices A £ M. Nxk and P £ R fcxM , where ||X||p t 2 := l-Yqj 2 ' s the Frobenius norm, 

||X||i?i := Y^ij \*ij\ an d a i v il are regularization parameters. 

A natural approach for matrix factorization is the singular value decomposition (SVD), which helps obtain 
the best low-rank approximation of a matrix in l 2 sense and extracts the most important components of the 
matrix based on the magnitude of their corresponding singular values. The factorization of SVD is of the form 

Y = UYV t (1.3) 


where we can interpret the matrices U, V as bases of information, £ as a weighting representing the importance 
of the corresponding basis vectors in U and V. Although this approach gives the best low-rank approximation of 
matrix Y in l 2 norm after a truncation of £, the SVD factorization is unstructured and usually does not respect 
positivity, often with the basis vectors of U and V being rather oscillatory. Especially for a matrix Y which 
represents an image or a probability density function, such an SVD factorization does not give us much useful 
information of the underlying structures that Y represents, e.g., identifying regions of high probability, locating 
objects inside the image, etc. Therefore we shall turn to NMF to obtain a more structural decomposition of 
the matrix that shall respect more the positivity of the basis. Now, combining the non-negativity constraints 
and the SVD gives rise to the idea of non-negative matrix tri-factorization, which was studied in [5]. In this 
work, we suggest and investigate the following version of non-negative matrix tri-factorization for non-negative 
matrix Y using l\ regularisation: 


min 

u> 0,E>0,V>0 


Y-UYV t \\% 2 


aPIki 


v\\U\\ F ,i + v\\V\\ f , 1 + 1 \\UU t - I Ik,! +7ll W T - (1-4) 


Similarly, we may interpret the matrices U £ Afjv/xp? U £ M px n as basis of information, £ £ M pxp as a 
generalized singular matrix. We emphasize that the matrix £ is not required to be diagonal in our setting here, 
but to be sparse only. 

We shall propose the application of the aforementioned model of non-negative matrix tri-factorization to 
big data and large images to extract their major components, which may represent some special structures or 
features, and obtain an approximation of the data with low memory complexity when the rank p is small, even 
when the original data and images do not attain any sparsity structure. This shall be quite effective, considering 
the fact that the factorization gives a low rank sparse approximation of the matrix in term of the tensor products 
of column and row vectors of U and V. The fact that p is small requires the storage of only a few columns and 
rows in the matrices U and V, therefore greatly reduces memory complexity. The sparsity of £ is also very 
important for the reduction of memory complexity because we only need to store the respective columns and 
rows of the matrices U and V, e.g., rq and Vj, where the corresponding entry a, :j in the singular matrix £ is 
significant. The sparsity of U and V are equally important because iq and Vj will then have a few number of 
non-zero entries and are inexpensive to store. These reasons suggest us to apply the above NMF model to big 
data and imaging. To effectively implement the NMF, we utilize the well-known semi-smooth Newton method 
based on primal-dual active sets [12] for the optimization process. It may be more advantageous than some 
classical methods .6 8, Using the result of NMF from the Newton method, we propose a dissection of the 

image into levels by its order of importance. 

We then proceed to propose a new concept of multi-level analysis (MLA) framework of the images based 
on the NMF, which aims to extract major components inside the matrix Y representing structures of different 
resolutions and obtain sparse low-rank approximations of different levels with positive basis. For each ine level, 
we hope to extract and represent features of up to a finer resolution with sparse approximation by positive basis. 
Our MLA framework is partially motivated by, though different from, the multi-resolution analysis (MRA) in 
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wavelet analysis, e.g. in [5]. The MRA framework is well-established to provide successive approximations of 
increasing resolutions of a function by a shifting and scaling of a mother wavelet. However, it has the property 
that the basis functions generated from the mother wavelet always do not have the same sign of the whole 
space. This is a very undesirable feature in our context. Hence, we introduce a new MLA framework, which 
shall respect the positivity of the basis for function/matrix approximation, and on the other hand provide a 
similar multi-resolution property as in MRA. In our MLA framework, we introduce a nested sequence of linear 
spaces H s each of which represents a level of fineness, and define interpolation operators among these spaces of 
coarser and finer levels. The NMF is then performed on each level to obtain a positive sparse approximation. We 
would like to emphasize that the main purpose of either our NMF model or the newly proposed MLA framework 
is only to identify and represent structures (of different scales) in the images or big data, and we are neither 
hoping to reconstruct the data in full entity nor aiming at very high-quality compression of image to defeat 
any available well-developed compression techniques, e.g. wavelet/curvelet compression, JPEG etc. Numerical 
experiments show acceptable resolution of images and data can be achieved by this sparse approximation using 
the MLA framework of the NMF model, as well as extracting the major features and components in the images 
and data without any a-priori assumption of their structures, such as sparsity and specific patterns. 

This paper is organized as follows. In section [2] the general mathematical framework of non-negative matrix 
tri-factorization using li regularization is clearly stated, and an optimal choice of the dimension of generalized 
singular matrix is investigated. An MLA framework using NMF is introduced in section [3] and a semi-smooth 
Newton method based on primal-dual active sets for NMF is presented in section [4j Applications of our 
framework to imaging and inverse problems are provided in section [5j providing numerical evidence for some 
successful feature extractions and sparse low-rank representation of the data. 

2 A non-negative matrix tri-factorization using l\ regularization 

In this section we shall clearly state the type of matrix tri-factorizations for our subsequent consideration. 
For the purpose, we often write Mmxn for the set of M x IV matrices and M^ xp C Mmxn for those with 
positive entries. Given a matrix Y £ M^ IxN , we define a functional : M^ xp x M pxp x M pxN —x R for 

a fixed set of parameters p, a, 7 : 

J^(U,Y,V) := \\Y-UYV T \\ 2 Ft 2 +'y\\Y \\ F>1 + u\\U\\ Ftl +u\\V\\ Ftl +a\\UU T -I\\ F , 1 +a\\VV T -I\\ F>1 . (2.1) 
Let [Up, S p , V p ] be a minimizer of the functional, then we define an operator I p ' v ' lm . M^ xN —X M^ xN by 

n := U p YpVp = ^a ij (u p ) i ®(v p ) j (2.2) 

where (u p )i, ( v p )j denote the column and row vectors of U p and V p respectively and ay,- is the th entry 
of the matrix E p . This non-negative matrix tri-factorization can be regarded as a non-negative version of the 
SVD, with matrix E p being the generalized singular matrix, which is not restricted to be diagonal as in the 
standard SVD. 

It is easy to note that with a smaller p, the memory of storing the matrix triple [U p , E p , V p ] is less. If 
E p is a sparse matrix, the memory complexity can be further reduced, as we only need to store the vectors 
(u p )i and {v p )j when ay,- is non-zero. In fact, for a generic matrix Y , if p can be chosen to be small and 
yet ||F — l p ' l ' , ' 1 (Y)\\p 2 can still be maintained to be a small quantity, then [U p ,T, p ,V p ] may serve as our 
desired sparse low-rank approximation of Y. However, it is obvious that the smaller the value of p is, the 
worse the approximation of Y by I p ’ u, ' 1 (Y) will be. With a smaller p, the error ||Y — X p ’ vn (Y)\\ 2 F and also 
Jp'”’ 1 (U p ,Y, p ,Vp), will be larger. Therefore, in practice, it is an interesting question to ask how we should 
choose the number p as N, M grow large. 

2.1 An Optimal choice of p 

In what follows, we aim to find an optimal choice of p with respect to N, M by means of a probabilistic 
argument. We first obtain a lower bound in terms of p,N,M,5 of the probability that there exists a triple 
[U, E, V] such that J p ' v,1 {U, E, V) < S. From this lower bound, we suggest an optimal choice of p to maximize 
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this probability. The value J p ' vn (U p , E p , V p ) reflects the derivations of matrices U p , V p from being orthogonal, 
the sparsity of and the error of the approximation of Y by J“’"’ 7 (y). In particular, if for some 

[U, E, U], we have J.(U, E, V ) < 6, then 






P ,Y P ,V P )<J^(U,E,V)<6. 


We begin by showing the following lemmas concerning a set of i.i.d. random vectors. Consider a set of 
i.i.d random vectors {Xi}f =1 G [ 0 , l] d , where the probability distribution dVx = fdx with dx denoting the 
standard Lebesgue measure and 0 <C'i</<C , 2 <oo. Then it is direct to see that the random variables 
{w.j := Xi/\\Xi\\ 2 }f =l € S d_1 has a probability density dP u = gdu; where du; is the standard surface measure 
and jjd=jj- < g < |.^j| for some other constants 0 < 61,62 < 00 . From this, we can derive the following 
important results for our subsequent analysis. 

Lemma 2 . 1 . Consider a set of i.i.d random vectors {Xi}fL 1 € [0, l] d , where the probability distribution dPx = 
fdx with dx denoting the standard Lebesgue measure and 0 < 61 < / < 62 < 00 . Then the probability of the 
vectors Ui := Xj/||X ;||2 that can be approximated by p points {Pj }^ =1 € S 1 ^ 1 |J[0, l] d within an error of small 
e > 0 can be estimated by 


p N (C 3 e)^ N < P 3{PJ ? =1 s.t. {wijili C |J B e (P t ) < p N (C 4 e)^ N (2.3) 

\ 1 <i<P J 

for two positive constants 63 and 64 . 

Proof. Using the fact that for small e > 0, Ce < sine < e for some 6 > 0, we can actually observe from the 
assumption of the i.i.d. random vectors and the binomial theorem that 


> 


> 


3{P.i}U s.t. {c 0 i}? =1 C |J B e (P t ) 


l<i<P 


E 


TV! 


N i= N 


=N IliW |s d 1 f|[°- 1 ] d l i . 1 n[o,i] d 


n 


p(ii, 


E 


TV! 


ILW 




Ni 


ELi Ni=N 

N (C 3 e)^ N 


P 


K || 2 < s) Ni dK 


for some 63 > 0. The other estimate is similar. □ 

Lemma 2 . 2 . Consider a set of i.i.d random vectors {Pj }^ =1 € [0, l] d , where the probability distribution dP u = 
fdu with duj denoting the standard surface measure and 0 < 61 < / < 62 < 00 . Then for p < d, the probability 
of the set of vectors Pi being almost mutually orthogonal within an error of small e > 0 can be estimated by 

p\d{C 3 e)^^ 1+ ^ d -^ <P{\{Pi,Pj) -6ij\ < eVz,j) <p!d(6 4 3e) Mi ^ 11 +( d - 1 ) (2.4) 

for two positive constants C 3 and 64. 

Proof. By direct counting, the fact that \\Pi — Pj\\ 2 = 2 — 2(Pi, Pf) and along with the half angle formula, we 
have for p < d that 


P((l Pi,Pj) - < eVi,j) 

> p\d(c 3 e) d ~ i 

l<i<p 

> P !d( 6 3 e) Mi F ii+ ( d ~ 1 ) 

for some 63 > 0. The other estimate is similar. □ 
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Lemma 2 . 3 . Consider a set of i.i.d random vectors {Xi]f =1 £ [0, l] d , where the probability distribution dVx = 
fdx with dx denoting the standard Lebesgue measure and 0 < C\ < f < C 2 < 00 . Then for p < N, the probability 
of the event E p , e representing the existence of {Pi] p i=l such that C Ui<i<p B s (Pf) and \(Pi, Pj) — Sij\ < 

e\/i,j for a small e > 0 can be estimated by 

(p N - (p- 1 ) N ) ^d(C 3 e)“ +(d - 1)(JV+1) < ¥(E p , e \E P -i,e) < (; P N (p-l) N ) 


for two positive constants C 3 and C 4 , and therefore 


Y (l N ~ {l- 1)^) i! d (C 3 e)^ 2 ^ +(d_1)(Ar+1) < P (E Pie ) < Y ( lN 
1 = 1 1=1 


(^l) 7 V )l!d(C' 4 £)^ +(d - 1)(iV+1) . 


Moreover, we have the following lower bound estimate 

P {E p>e ) > dp N (C 3 eY d - 1 ^ N+ ^ +m ^ 1 


(2.5) 


Proof. The following inequality follows directly from the argument of the above two lemmas 


E 

J2 P i =1 N i=N , Ni>0 

< P (Ep tE \E p -i }e ) 


TV! 


< 


E 


rw 

TV! 

TW 


pidiCzey-^+^-^+v 


p!d(C 4£ ) E ^ +(d - 1 » iV+1) 


Ni=N , Ni>0 

Now since the last term can be simplified as follows: 

p!d(C 3£ ) E ^+( d - 1 )( iV+1 ) Y, 

Y. P i=i Ni=N,Ni>0 

TV! 


TV! 


= p!d(C 3 e) P -^ + ( d -^ N+ V ( £ 

N i= N 


n,w 

- E 


TV! 


ilw ..UiNii 


Y. p Zf Ni=N ' 


= (p N - (p- l) N )p\d(C 3 s) si ^ 1 +^- 1 ^ N+ ^ 


we directly have 


Y ( lN - (l - 1)^) l'- d (C 3 £) ii V Li+(d - 1)(W+1) < P (E Pl e) < Y ( l N 
1=1 1=1 


(l-l) N )l\d(C i e) ,Ji ^^ d - 1 ^ N+1 K 


The last inequality comes readily from 

P (E p , e ) >Y(l N -(l- if) diC.e) 1 ^^- 1 ^^ = dp N (C 3 ef-^ N+ ^ +M ^ . 
1 —1 


□ 

Now we consider a general image or large data Y = JT . Y,j e 4 ® e ; comprised of non-negative entries. 
Without loss of generality, we may assume max,^ \Yij\ = 1. If we write 1) := JN Yy e^, and w* = Yi/||Yj|| 2 , then 
Y = Z)j ||Yi || 2 e* <g> uii. If there exists a set of {P »}? =1 such that C Ui<i<p Be (Pi) and | (Pi, Pj) ~ $ij\ < 

e \/i,j, we can write G B e (Pj) for some Kj with 1 < j < p. Then intuitively, we have 

p K j 

I = E Utilize* «EE ll y fcj|2% ® Pj • 

i j—1fcj=l 
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Writing Qj := CEk -=l \\ Y kj \\-2e kj ) / \jT,k-=i W^b and denoting oy,- = \JY%~i I l y % lb, then 

I ~ ^ ( Oy^' Qj CT Pj 

where |(Pi,Pj) - <%| < £ and |(Qi,Qj) - \ = 0 for any i,j. By setting S = (cry), P = (P) T , Q = {Qj), we 

derive directly that 


k. 


hence 


Qi® P j\\F 2 <EE ~Pj\ < \\I\\ F , 2 S<NMe, 

i j =1 /cj — 1 


J"’^(U P ,£ P ,V P ) < J?™(Q,E,P) 


< I|/Ik 2 £ + 7E, 


K, 


Y n y fei lb + ill'll 1 HPilli + otp{p - 1 )e 

\ fcj=l J ® 


< NMe + NM^2v) + ap(p — l)e. 

The probability of the event E p , e such that the above estimate holds can be bounded below by 

V 

[r -(i- 1 y-)i\M ((j 3 £; IJtzll ~ 

1=1 


'(Pp, E ) > Y ( lN - 1)^) l'-M(C 3 e)^~ +( m - i )( jv + i ) 

1=1 

> Mp N (C 3 e)( M - 1 K N+ V +m %= 11 . 


Similarly, switching the columns and rows of the image, we may follow the above argument and analysis to 
conclude the same with TV, M swapped. Combining the above two statements, we come to 


> 


Jp’^iUp, £ P , V P ) < NMe + TVM ( 7 + 2v) +ap(p- l)e) 
Y (r ax(Ar ’ M) — (Z — l) max bV M )) /! min(TV,M) (C 3 e)^ n ' m ' 1) 


1 = 1 


> min (TV, M)p max ( JV,M ) {C 3 e)^ n ' m ^ , 
where the function fi( ■, •, •) is defined for all TV, M, l £ N by 

/x(TV, M,Z) := l<yl ~ ^ + MTV — |TV — M| — 1. 


( 2 . 6 ) 


If we further choose the parameter 7 + 2v < (K — l)s for some K > 1, then we deduce the following lemma. 
Lemma 2 . 4 . For any small e > 0 and for all N,M £ N, it holds 

P ( J^iUp , Ep, < (KNM + min (TV, M ) 2 ) e) 

E (/ max(JV ’ M) — (Z — i) max W. M )j ji min (7V,M) (C^sf 


> 


;=i 


> min(TV, M)p ma ^ N ' M '> (C 3 e)^ N ’ m ’ p '> 


(2.7) 

( 2 . 8 ) 


where the function p( ■, •, •) is defined as in (2.6) and 7 is such that 7 + 2^ < (A' — l)e /or some AT > 1. 

Before we derive a sharp bound of an optimal choice for p , let us consider a rough lower bound introduced 
in the last inequality (2.8). Clearly if we consider the function 

F(p) := min (TV, M)p max < iv ’ M ) (C 3 e)^ NM ’ p) 
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for p > 1 , then it is easy to see 


log(C 3 e)| 3 2 


F'{p) = —M fmax(N,M) + 1 ' - | log(C 3 e)|(p - j) 

p \ 16 4 


= > 0 . 


namely 


f<l 

1 3 

rr 



— 

1>J 

I 4 ^ 

1 16 


Therefore we can propose a primitive optimal choice of p to maximize the lower bound of the possibility 
P ( J p a ’^([U P , V p ]) < (KNM + min (M, TV) 2 ) e ^ , i.e. to choose 


p = 


I ma x(M, TV) 


I log(C 3 e)| 

for large TV, M. Following some basic substitutions, we obtain the following theorem. 
Theorem 2.5. For any small 6 > 0, we have 

P Lin J p a ’^(U p ,£ p ,V p ) <§]> (Cfee)* 


(2.9) 


( 2 . 10 ) 


whenever 7 + 2v < (K — 1) e, where s:= 5 (KNM + min(Tbf, IV) 2 ) 1 for some K > 1, the function /i( •, •, •) 
is defined as in ( 2 . 6 ), and pn,m,s stands for the following constant 


Pn,m,s '■= 


I max(M, TV) 


max(M, TV) 


| log(C 3 e) | y \og(KNM + min(M, TV) 2 ) — | log £| — log C 3 
When M = TV, it is obvious that the above optimal choice of p for a fixed S > 0 is of the form 

I TV~ 


( 2 . 11 ) 


TV 


P = PN,N,6 = 


2 log TV — | log S | — log C 3 + log (K + 1) y 2 log TV 
as TV goes to infinity. The last asymptotic relation actually gives a precise approximation and 


( 2 . 12 ) 


TV 


< Pn,n,s < 


I TV 
log TV 


(2.13) 


2 log TV 

if TV is large enough such that TV > C^T) -1 . Hence (2.12) serves as an optimal choice of p for large TV. 


Furthermore, with this choice of p, the memory complexity is asymptotically \J as TV goes to infinity. 
However, we note that the optimal choice of p obtained above is only based on a rough lower bound (2.8). In 


what follows, we deduce a sharper bound by using ( 2.7|. S ince \2.7\ always increases with respect to p, we get 
an optimal choice of p by controlling the increment of 
the ratio of the terms 


2.7) with respect to p. In order to do so, we investigate 


explicitly given by 


a, := (> ax(JV ’ M) _ (l _ l) ma x(JV,M)^j n min (jv, M) {C 3 e)^ n ' m ' 1) 


„ /7 _i_ 1 \max(iV,M) im&x.(N,M) 

°f+l = V + 1 T ] r -\]Q K (C*em + l) 

^ £max(iV,M) _ ]^max(AT,M) 
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From the l’Hospital rule, we can directly see that for a fixed pair of N,M , the above term ai+i/ai —> 0 as 
l —> oo. Therefore, given a small rj < 1, there is always a pn,m.ji.e such that cq+i/az < 77 whenever l > pn,m,ti,e- 
Then for all p > Pn,m,ti,e we have 


> 


( Jp’^iUp , Sp, V p ) < (.KNM + min (N, M) 2 ) e) 

T.M,rj,e 1 

J2 (l™ x(JV ’ M) - (1 - l) max(jv>M) ) d min(iV, M) (C 3 e) KN ' M ’ l) 

(p W ,,. - !)»“<"■">) (p N , M , v ,e)\ min(IV, M) ( C3£ )eWM,p N , Mi „ e ) 


!=1 

1 

1 - 7? 


PN,M,n,e 


,ma x(N,M) 


whenever 7 + 2 z/ < (A' — 1 ) e, and that the increment of p from Pn.m.ti.c onward brings insignificant increment 
to (2.71. Now we aim to find an explicit Pn,m,t>.e in terms of N, M, thus obtaining an optimal choice of p. By 
Holder’s inequality we readily derive 




E max(AT,M)-l /-< 
2—0 V 1 


1 /P ) 1 


E max(AT,M) — 1 /-, i / \ 

i=o U-VP) 

Now if we consider the function 


-pe 


-| log(C 3 e)|(p+l) 


< 


p(p _|_ ]_)max(JV,M)-l 
_ ]^max(iV,M )—1 


-pe 


-| log(C 3 e)|(p+l) 


(2.14) 


G(JV 0 ,p) 


P{P + 1 ) ° -| log(C 3 e)|(p+l) 

(p_ l)JVo-l 


for p > 2 and N 0 >2, then we see 


|-G(JVo,p) = G(iV 0 ,p) (- + - 

op \P P +1 


N 0 -l 

P-1 


I log^e) I 



0 , 


that implies 


P 


< 

> 


Po{N 0 ) 


where po(-^o) is the unique real zero of —| log(C 3 £)|p 3 + p 2 + (| log(C , 3 £)| — 2 N 0 + 2 )p —1 = 0 , which can be 
found explicitly by the Cardano’s formula or the Lagrange’s method. Fixing Nq, we get that G(No,po(Nq)) 
is the global maximum of G(No,p) on (2, 00 ), and from pq(Nq) onward, the function is decreasing. Together 
with the fact that G(_/V 0 , 2 ) = 4 ( 5 / 3 ) Ar ° _ 1 (C 3 e) 4 , we have that G(-/V 0 ,-) _1 : (0, 4 (C' 3 £) 4 ) —► (j>q,oo) is a well- 
defined smooth function and is monotone by the inverse function theorem, and that the implicit function 
g : (l,oo) —l (l,oo) defined by G(Nq, g(No)) = 77 is well-defined and smooth by the implicit function theorem 
as g(No) = [G(_/V 0 , •)]” 1 (77). Moreover 


<§(N 0 ,g(N 0 )) 



( 9 + 1 
\9~ 1 
9 + l\ 

9-l) 



Np-1 
9 + 1 


|log(C 3 £)|s 3 -g 


~ N g °_ 1 - |log(G 3 e)|^ 
g(g + 1 )(g - 1 ) 

2 -(|log(C 3 £)|-2jVo + 2) 5 + l’ 


Now noting that g(N 0 ) > Po(Nq) + 6 > 1 for some 5 > 0 by our choice of domain, we have | log(C 3 e)|p 3 — p 2 — 
(| log(C 3 e)| — 21Vo + 2 )p + 1 > C > 0 for some G, and 0 < g'(N 0 ) < 00 for all N 0 as well as g'(N 0 ) —> 00 as 
Nq —» 00 . Moreover putting these inequalities back into the expression of g ', we have g'(No) -> 0 as iVo -> 00 , 
and that g satisfies the following differential inequality for large N 0 , 


, /p + l\ 2 4 

9 ~ ° S \9 ~ 1 / |log(C 3 e)| “ (g- l)|log(C 3 £)| ' 

Now using the Gronwall-Bellman-Bihari’s inequality, we directly infer that 

g < H~ 1 {H(a(r))) + N 0 ) 


(2.15) 
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for some constant a{rf) depending only on 77 , where the function H is defined as 


*(»):= 1^^!/(.-!)* = 


I log(C' 3 e)|(s — l ) 2 


Koiv) 


4 J 8 

for some K 0 {r]). Therefore the following inequality holds for g and some constants Ki(r/), Ki(rj), K 3 {rj): 


(2.16) 


9 < 


'Ki(r])N 0 - /v 2 ( 77 ) 


|log(C 3 e)| 


K 3 (??) ■ 


Using pn,m,$ defined in (2.11 1 , we can choose pn,m,ti,c such that 


/ma x(N,M) 

P N ,M ,r] ,e = J^v\ 71 - Tr< - 7“ = 1X r)PN,M,S 


(2.17) 


I log(C' 3 e)| 

for some K v depending on 77 , then for all p > PN,M,rj,e > g (max(7V, M)) = [G (max(7V, M), •)] 1 ( 77 ), we have 




■pe 


< Tj . 


Therefore the growth of the probability P (^J p ’ v, ' y (U p ,'E p ,V p ) < (KNM + min(iV, M) 2 ) with respect to p 
becomes insignificant from pn,m,ti,e onward. This gives another optimal choice of p. Surprisingly, we notice 
that pN,M,ri,e ~ Pn,m,s , be., the two choices of p are of the same order. This leads to the following results. 


Theorem 2.6. The following probability bound holds for any small S > 0: 

p 

P (j“’"’ 7 (U p ,S p ,V p ) (l max(N ’ M) -{l- l) max(JV ’ M) ) U min (N,M) {C 3 e) 

1=1 


n(N,M,l) 


(2.18) 


whenever 7 + 2 ^ < (K — 1) e, where £ := 6 (KNM + min(M, N) 2 ) 1 for some K > 1 and the function ■, •, •) 
is defined as in (|2.6[). For a given small constant 77 , the growth of the summation above with respect to p can be 


controlled by 77 when p > K^pn^mp for some K p depending only on 77 , where pn,m,s is defined as ( 2 . 11 ). 


We can easily see that \\Y —I p ' v " l {Y )\\ 2 F2 < 5 if J p ,v, 1 {U p ,fi p ,V p ) < 5. Clearly, in the particular case when 
M = N, the following asymptotic order for p 


I N 
log N 


(2.19) 


is basically an optimal choice of p, and they are equivalent up to a multiplicative constant whenever N > C 3 J , 
Following this optimal choice of p, the memory complexity grows in the order y as N goes to infinity. 

2.2 Effects of magnitudes of entries in generalized singular matrices 

In this subsection, we discuss a further reduction of memory complexity by truncating the generalized 


singular matrix E p = (oy,). We aim to remove the less important components ( u p )i (g) ( v p )j in (2.2) in a way 
that it still serves as a good approximation of the original matrix Y. 

For doing so, we rearrange 07- from the largest value to the smallest one as ai 1 j 1 > cri 2 j 2 > .. > cn p 2 j p2 - 

We then denote 07 = Cupel < 8 > e;, and write E PiP = XT=i as the truncated generalized singular matrix for all 
p < p 2 . The sequence {cq}f = 1 represents the components of E p in descending order by the importance of its 


magnitudes. With the above definition, we then define an operator I p T ' 1 \ Mff IxN 


Mm x n hy 


1^(Y) := U p Y, P:P Vp = 


p 

u hji y^pjn ^ \ v p)3i 

1=1 


<J iljl(Up)i l ® (Vp)j 


( 2 . 20 ) 
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where [U p , E p , V p \ is a minimizer of the functional (2.1) and E P)P is the truncated generalized singular matrix. 


The approximation Y ■ 


P’ P 

' Ip’p’^iY) = UpTjp t pV p is a truncation of the approximation (2.2) of Y up to p. This 


truncated approximation removes the less important components., hence we only need to save the vectors ( u p ) 


pni 


and {v p )j t for 1 < l < p. This further reduces the memory complexity and serves as our desired sparse low-rank 
approximation of Y. 

In what follows, we give a brief analysis for the aforementioned truncated approximation of Y. Indeed, from 
the pigeon-hole principle, we directly obtain that 


j p a ^(Up,£ p , p ,v p ) < J^iUp^Vpj + cwi^ £ 


i=0 

pi 


p — l 


< Jp^^iUp, E p , V p ) + Cj|/||i _ 1/xdx 


< 


Jp' v ' n (U p , Ep, Vp) + CNM log (^j 


2.5 


< ((K + LC)NM + mm(N,M) 2 )e 

whenever J p ' vr, {U p , E p , V p ) < ((KNM + min (N, M ) 2 ) e and p > e~ Le p 2 . Combining this with Theorem 
the following corollary follows directly. 

Corollary 2.7. Let e := 6 ((I\ + CL)NM + min(IW, TV) 2 ) -1 , then the following estimate holds for any small 
S > 0 and 7 + 2 v < ( K — 1 ) e, 

'mmJ^(U p ,£ p , p ,V p ) < ^ > 

where pn,m,s stated in (2.11) and p > e~ Le p 2 for some C, K, L. 


3 Multi-level analysis (MLA) of non-negative tri-factorizations 

In this section, we introduce a multi-level analysis (MLA) framework based on the aforementioned tri¬ 
factorization. We notice that, for a matrix Y, especially for those representing an image, there are features 
of different scales in Y which usually represent different objects in the image. We aim at extracting these 
features of different scales and represent them in a sparse low-rank approximation in terms of tensor products. 
Therefore we introduce a MLA framework to NMF which helps us achieve a sparse representation of the features 
of multiple scales, ranging from the coarsest scale up to the finest scale in the image Y. This MLA framework 
aims to identify the major components in the matrix Y which represent structures at multiple scales/levels of 
the image so that structures from large scales up to small scales in the image can be separately identified and 
sparsely represented. Our MLA framework is partially motivated by the MRA in wavelet analysis, which is 
widely use to capture different resolution of a function or image as well as for compression purpose. However, 
an essential difference of our MLA framework from the MRA lies in our hope to respect the positivity of the 
basis for the function/matrix approximation, but still obtain a similar multi-resolution property as in MRA. 

The most primitive idea of MRA is to successively approximate an L 2 -function by dyadic shifts and dilations 
of a wavelet function ip (a.k.a. the mother wavelet), which results in multiple resolution of the L 2 -function 
concerned. More precisely, we recall, e.g. in [5], that an MRA in wavelet analysis consists of a nested linear 
vector spaces, • • • C V\ C Vo C V-± C • • •, such that their union is dense in L 2 , and that they satisfy self¬ 
similarity conditions in both time and scaling as well as a regularity condition requiring the integer shifts of a 
piecewise continuous scaling function ip with compact support (a.k.a the father wavelet) shall form a frame for 
the subspace Vo Cl 2 . In the case of integer shifts on R, the above assumptions of nested linear vector spaces 
implies the following dilation equation, e.g. in there exists a finite sequence of coefficients Ck with |fc| < N 
such that 

N 

tp(x) = ^2 Ck<p{ 2 x ~ k ). 
fc=-AT 
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The mother wavelet can then be defined as 


N 

^(x) := ^ (-l) k d- k (p( 2 x - k ), 
k=—N 


and with this definition, one can easily render that V)_i = V) ® Wi for all l, where W) C Vj_ 1 denotes the closed 
subspace generated by the frame {tp(2~ l x — k) : k £ N}. Recursively, we can show that 

' > ip(2~ l x — k) : l,k€ Z} 


shall form a complete orthonormal base in L 2 (R) and that L 2 (R) = W). A similar result holds for higher 

dimension with a similar argument. 

However one can directly infer that the mother wavelet ^ has the property Enin] that 


ip{x) dx = 0, 


which directly implies that the function t/j can never have the same sign on the whole space. Therefore the 
approximation of an L 2 function by 

/ = ^ Cz.fc 1pl,k j 
i,ke z 

though acquiring the multi-resolution property, fails to be a representation of / by positive basis. This obser¬ 
vation that ijj does not have the same sign over the whole space is also true for higher dimension. Therefore 
it may be an undesirable feature if function / is positive, and when we hope to approximate the function with 
positive basis. This is the case when the function/matrix represents an image or a probability density function. 
This motivates us for a non-negative version of a similar multi-level approximation of the function based on the 
NMF technique, which we name as the multi-level analysis (MLA), in hope that each increasingly fine level of 
approximation of the function by positive basis shall represent an increase of resolution in some sense. 

In what follows, we give a mathematical framework for the MLA in NMF. For the sake of exposition, we 
introduce the following several operators which are very useful in the subsequent discussion. We first define an 
interpolation operator l s : Mmxn —■► Mm x jv as the following averaging operator: 


L s(y) ■— Xm r 2 S X] Ykl e i®ej 

l<i<M/r s ,l<j<N/r s kJeQ^.j, 


(3.1) 


where Qi u j j contains the entries iM/r s < k < (i + 1 )M/r s ,jM/r s < l < (j + 1 )M/r a . We note that this 
interpolation operator gives an interpolation between a fine space Hq := AImxn to a coarse space H s := Mm x jv , 
and the spaces H s actually forms a nested sequence of spaces, i.e. H s C Hi if s > l. One may actually define a 
more general nested sequence of spaces and interpolation operators, but for the sake of simplicity, we shall only 
discuss this averaging operator. Then we define If’p’ 1 : M^ xN —> Mm x n by 


T“> I/ >7 — ,T 0 ja, 7 
-k S,p ■ L S U -kp U u 8 


(3.2) 


The approximation I “jJ /,7 (Y’) represents the approximation of the (s max — s)-th level of the image Y by NMF 
where s max < [log(min(AT, M))/log(r)] and [•] denotes the floor function. Similarly, we define : M^x N 

Mmxn by 


✓y-a,Z/',7 


I I 


•a, 177 
P,P 


O 


(3.3) 


which serves as a truncated approximation of the (s m ax — s)-th level of Y. 

Now we are ready to investigate and analyse the error of the approximation given by this MLA framework. 
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In fact, it is easy to see by combining the arguments in previous sections and the Poincare inequality that 

II Y-lJ oI“b 7 o l s {Y )\\ 2 F2 < r 25 || is (F)-J“^o is (F )||^ 2 + ^||V 5 y /J ||| i2 

I,J 

< r 2 s J^(U p , E p , p , V p ) + J2 l|V«V/j ||| |2 

i,j 

< r 2 s J^(U p ,t p „V p )+r 2s ^Cr~ 2 s NM log ^ ||V 5 y /7 |||, 2 

where V 5 is the difference gradient operator defined as (VsY) i j = (^i+iy — ~ Yi,j)> the matrix Yjj 

are the (/, J)-th block of the Y matrices, [U p ,Y, p ,V p \ is an argument minimum of (2.1 1 with Y replaced by 
i a (Y) and E PiP is the truncation of E p up to p as stated in the previous section. 

Therefore if we can choose [U p ,'S p ,V p ] such that J p ' v, 1 fU p , E p , V p ) < r~ 2s ((KNM + mm(N, M) 2 ) e and 
p > e~ Le p 2 , then 


||y-2^P0||S , 2 < ((K + CL)NM + m.m(N,M) 2 )e + Y / \\V S Y IJ \\ 2 F} 

i,j 


2 • 


Let p r — s N,r— s M,s be defined as in (2.12). Then we know from the discussions in the previous section that the 
probability of the above event, denoted as E pp g, is bounded below by 

P (E Pt p tS ) > r~ s min(7V, M) N ’ r m,p t-m,s) f or 7 + 2v < (.K - 1 ) e . 

In general, we have no hope that either 11V^y1 2 or j II^ 7 iS^/jIIf 2 can he controlled, since we did not 
impose any regularity conditions for Y in general. However, if we further assume that Y has some regularity, 
for instance Yhi j < K 0 MNe , then 

I \Y ~ i:, p y(Y) 11 2 12 < ((A- + CL + Ko)NM + min (N, M) 2 ) e . 

Combining all the previous arguments and theorems then yield the following results. 

Theorem 3.1. Let e := —r 2s S ((AT + CL + Kq)NM + min(7V, M) 2 ) 1 , and for any small S > 0, E PtP j be the 
event such that the following inequality holds: 

\\Y-Kp:p( Y Y 2 f ,2 < ((K + CL)NM + mm(N,M) 2 )e + Y / \\V S Y I j\\ 2 Fi2 , 

I,J 


then if p is chosen such that p > e Ls p 2 , we have for any s and 7 + 2is < (K — 1) e that 


(J E PtPtS j > r s min(IV, M ) p 

\ V / 


r s max(iV,M) /^ s N,r s M,p r _ s 

'r-‘N,r~ s M,5 


(3.4) 


where the function and p r -°N,r- s M,5 are defined as in (2.6) and (2.11) respectively. For all s and 

p < r~ s min(IV, M), we have 


'{E p , p ,s) > y Smax(iV ’ M) - (/ - l) r Smax(JV ’ M) j l\ r~ s mm(N, M) (C 3 e) 
1=1 


fi(r s N,r s M,l ) 


(3.5) 


whenever p > e Le p 2 and 7 + 2u < ( K — 1) e. For a given small constant 77 , the growth of the summation 
above with respect to p can be controlled by 77 when p > K v p r -s N r -s M s f or some K v depending only on 77 . 
Furthermore, when the event E PtPi s happens and the inequality Y2i 7 1IV 5 I 7 j\\ 2 F 2 < KqMNe holds we have 


\Y-K':.p( Y )\\h< Y 


(3.6) 
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Now we can see from the above theorem that for a given threshold 6 and M = N, if Y has the regularity 
such that ||V 5 F|||, 2 < I\5 for some I\ < 1, then the lower bound of the probability of \\Y —Ifpp(Y)\\ 2 F2 < 6 
is higher than that of ||F — (^) I If ,2 < ^ with an appropriately selected p. Furthermore, for each s, the 

optimal choice of p has the same order as p r -s N r ~ s M,Si which behaves asymptotically like 


P 


r. —S/2, 


N 


log N — 2s log r ’ 


(3.7) 


with the memory complexity of 1%'p’p growing in the order r 3 ®/ 2 y lo „ N ^ s log r as ^ goes to infinity. This 
tells us that, by increasing s, the probability of fine approximation by NMF is increased as well as the memory 
complexity is decreased. Moreover, from numerical experiments, we can observe that the resulting approxima¬ 
tions from larger values of s capture the coarser features of Y, then achieve finer and finer features as s 

decreases. 


4 Semi-smooth Newton method for non-negative factorizations 


In this section, we propose and describe an efficient and cost-effective numerical algorithm to realise the 
NMF of the image or big data Y as we discussed in the previous sections. 

Instead of finding the optimal solution [U p , £ p , V p \ of the functional ©, we shall propose to perform the 
following alternative two-stage NMF to obtain an approximation of Ip ,v ’ 1 (Y) in two stages: 


Y « AV 1 


y t u t . 


then combine to get Y ~ UT,V 2 


(4.1) 


In each of the above two NMFs, we minimize the functional (1.2) via a semi-smooth Newton method based on 
primal-dual active sets j!3j , which will be derived below. The semi-smooth Newton method is more advantageous 
than some classical methods m 0 and converges faster. This two-stage process does not yield the optimal 
solution [U p , H p , V p ] of the functional (2.1), but generates an sufficiently fine approximation of I“ ,l/ ’ 7 (T) as we 
shall observe from our numerical experiments. More importantly, this two-stage process is more user-friendly 


and less expensive computationally, since the linearized systems of the functional (2.1) involved in the semi¬ 


smooth Newton iteration is much more convenient to evaluate numerically than the systems encountered when 


one minimizes (2.1) directly. 


4.1 Semi-smooth Newton method based on primal-dual active sets for NMF 

Before we present a two-stage NMF for an approximation of I^’ un (Y), we first discuss some mathematical 


properties of the important non-convex minimisation problem (1.2). The semi-smooth Newton method based 


on primal-dual active sets were proposed earlier in m to solve either convex or non-convex non-smooth opti¬ 
mization problems effectively by combining the ideas of active sets and Newton-type update. In this section, 


we formulate this method for solving the non-smooth non-convex optimization (1.2): 


min J(A, P) := \\Y — AP\\ 2 f0 + a 
A>0,P>0 ’ 


F, 1 + v\\P\\f,i + l\\PP T - 


(4.2) 


4.1.1 Complementary Conditions 

We first recall two complementary conditions for the characterization of some constraints conditions from 
m, which is crucial for the development of the algorithm in the subsequent analysis. For this purpose, we will 
need the sub-differential of the function | • | : R —> R, which is the set-valued signum function defined by 

! 1 if x > 0 , 

[-1,1] if x = 0 , (4.3) 

— 1 if x < 0 . 
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We shall also often require the following complementarity condition m which characterizes the set-valued 
sub-differential d\ ■ | by 


A = 


A + cx 

max(l, |A + cx\) 


for any given c > 0 , based on the following equivalence. 
Lemma 4.1. For any given constant c > 0, it holds that 


A = 


A + cx 

max(l, |A + cx |) 


•<=> A € d| • |(x). 


(4.4) 


(4.5) 


Proof. First we assume A = max^A+csl) • ^ IA + cx \ — 1) then A = A + cx, which gives x = 0, hence |A| < 1 
and A G d\ ■ |(x). For |A + cx\ > 1, we know A = jx+ffy = ±1. If A = 1, then |1 + cx| > 1, which directly gives 

x > 0, therefore A G d\ ■ |(x). The case for A = — 1 is similar. 

Now we assume that A £ d\■ \ (x) . If x = 0, then |A| < 1, therefore A = max (\ | A |) = ma x(i~ t |A+ca;|) • Furthermore, 

if x > 0, then A = 1 and |A + cx| > 1, therefore -—r\ = ttt^t = A. The case for x < 0 is similar. □ 

’ 1 l 7 max(l,| A+cx|) |A+ca;| 


Note that in the above complementary condition, the choice of c is arbitrary. However, in a practical imple¬ 
mentation using the complementary condition, c is often chosen as a fixed constant that acts as a stabilisation 
parameter. 

Now for any matrix A, we note that ||A||i?i = )Tb . |Aj i3 -|. Then the set-valued sub-differential function 
d[| ■ ||f,i(A) is given by 


f 1 if Aij > 0 , 

(0|Hki [4 )*,,■ = <[-1,1] ifA 4J =o, 

[-1 if Aij < 0 . 


Using the complementarity condition (4.1) for a dual variable A, we have 

^ = max^;Xl) " Aga|Hk i^- 


(4.6) 


(4.7) 


We may often write this simply as A = max ^~ t |^ ey 4 |) ; where the division, the maximum and the absolute value 
are all taken point-wise. 

Next we introduce a second complementary condition that is used to characterise an inequality constrain 
x > 0 [13] . We sketch the argument from [13] to motivate our desired complementary condition. For a functional 
F : M. N —> R, consider the constrained optimization: 


minF(x) subject to x > 0 . 


(4.8) 


We introduce its following equivalent augmented Lagrangian formulation with the same necessary optimality 
condition and a dummy variable z and a Lagrangian variable p,: 


minF'(x) + (fi, x — z) + - ||x — z \|| subject to x = z and 0 > 0 . 


(4.9) 


This functional is clearly convex in 2 . Minimizing it over 2 > 0, we obtain the following entry-wise necessary 
and sufficient conditions for z: 


Hi + c(xi — Zi) < 0 if Zi = 0 , 


or 


I Hi + c ( x i - Zi) = 0 if Zi > 0 
which gives the following unique minimizer for the variable z: 

Hi + cx 


0 > Hi + cx i if Zi = 0 , 
Zi = if Zi > 0 , 


Zi = max 0 


^ Hi + cxj ^ 


(4.10) 


(4.11) 
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Or we will also write it simply as z = max(0, ). Using this, we can directly compute 


(H,x-z) + -\\x-z\\l 

= 1 (p,mm(cx,-p)) + ^|| min(ca;, -p)\\l 

c 2c 

= ^ (||min(cx,-/z)+//]|^ - |Nl) 

= \c (N min (M + ccc,0)||1 - H/xlla) ■ 


Substituting this expression into the functional in (4.9), we obtain its equivalent minimization: 

minF(x) + — ^|| min(/i + cx, 0)111 - ||/tx|||) , 
whose necessary optimality conditions are given by the following set-valued equations: 


(4.12) 


(4.13) 


0 £ dF(x) + min(^ + cx, 0)<9min(-, 0) (min(/x + cx, 0)) , 
0 £ —p + min (p + cx, 0)i9min(-, 0) (min(/r + cx, 0)) . 


(4.14) 


Equivalently, by a point-wise comparison, we know min(/z + cx, 0)9min(-, 0) (min(/x + cx, 0)) = 0 if p + cx < 0. 
Then we see from the above necessary optimality condition that /i = 0 and min(/Lt + cx, 0) = 0, therefore 
p = min (p + cx, 0). On the other hand, if p + cx > 0, we obtain that min(/x-|-ca:, 0)<9min(-, 0) (min {p + cx, 0)) = 
min(/x + cx, 0). This, along with the necessary optimality condition, yields that p = min(p + cx, 0). Therefore, 
by combining the above two cases we arrive at an equivalent optimality condition for /Li, p = min(/i + cx, 0). 
This leads us to the following necessary optimality conditions. 


Theorem 4.2. 


The necessary optimality conditions for the minimization problem (4.8) are given by 


0 £ dF(x) + yi and /Lt = min(/Li + cx, 0). 


(4.15) 


The condition p, = min(/u + cx, 0) for the dual variable p is regarded as a complementary condition in [T51 . 
which serves as a characterization of the constraint x > 0. This complementary condition may also be regarded 
as a project of the solution to the convex set as the epigraph defined by the constraint. 


4.1.2 Necessary optimality conditions for the optimization (4.2) 

By directly applying Theorem |4.2| and calculating the sub-differentials involved, we come to the necessary 
optimality conditions for the optimisation (4.2) using the primal-dual variables (A, P, pa, pp) for a given ci: 


r0 £ d A J(A, P) + p A = 2APP T - 2 YP t +p A + ad\\■ ^(A) , 

I PA= min(/Lp 4 + ci A, 0), 

| 0 £ d P J(A, P)+p P = -2A t Y + 2A t AP + p P + vd\ \ ■ || Fjl (P) + 7 {d|| • \\ f ,i(PP t - /)} (P + T o P T o T) , 
{pp = min(/ip + CiP, 0) 


where T : Mmxn —> Mn x m is the transpose operator that maps A to A T . Now, applying hemnia flTI to the 
above system and introducing two more variables R, L, we obtain the following optimality conditions. 


Theorem 4.3. 


The necessary optimality conditions for the optimisation (4.2) can be given in terms of the 
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primal-dual variables (A, P, R, L, pa, Aa, Up, Ap, Al) and two constants Ci,C 2 by 


0 

Aa 

Ma 

0 

< A p 
L 
R 
A L 

ji P 


= 2 APP T - 2 YP t + p A + aX A 

_ _ Xa~\~C2A _ 

max(l,|A,4+C2A|) 

= min( / UA + c iA, 0) 

= -2 A t Y + 2 A T AP + pp + v\ P + 7 A l R 

_ Ap+C2P 

max(l,|Ap+C2P|) 

= pp t -1 

= P oT + T o P T oT 

_ _ \l~\-C2L _ 

max(l,|Ai / +C2-£/|) 

= min (pp + C\P , 0 ). 


(4.16) 


4.1.3 Semi-smooth Newton strategy 


We derived the necessary optimality conditions for solving the optimization problem (4.2 1 in the previous 


subsection. We shall now develop a semi-smooth Newton method for solving these optimality systems, which can 
be readily shown to be Newton differentiable |13j . To further develop our algorithm, we separate the variables 
(A, P, R, L, X A , Up, Xp, Ap) into three sets, i.e., (A, p A , X A ), (P, Up, Ap) and (P, P, Ap), and solve for each 
set of variables independently. Clearly, the separated systems are easier for us to perform active set techniques 
and greatly reduce the computational costs, and more importantly, each separated nonlinear system consists 
of much fewer variables, and is therefore much more stable when performing semi-smooth Newton iterations. 
With these motivations, we separate ( |4. 16 1 into three sets of equations: 

(1) For a fixed P, solve the system for ( A , p A , A a): 


I 0 

< A A 

l/M 


= 2 APP T - 2 YP t +PA + a\ A , 

_ _ Aa+C2 A _ 

— max(l,|AA+C2A|) ’ 

= min(/iA + ci A, 0). 


(4.17) 


(2) For the fixed A, L, P, Ap, solve the system for (P, pp, Ap): 

fo =-2A T Y + 2A T AP + p P + pX P +'yX L R 

J \ _ A P +c 2 P 

| P max(l,|Ap+C2-P |) 

[ /ip = min (/ip + CiP, 0). 


(4.18) 


(3) For a fixed P, solve the system for (L, P, Ap): 


L 

R 


Ap 


= PP T - I 
= PoT + ToP t oT 

_ AZ.+C 2 L 

max(l,|Ap+c 2 i|) 


Now we introduce the following active and inactive sets: 


Aa, 1 {(b i) • it^A^ij d - c\A^ j 0 } , 
Aa ,2 = {(bi) : | (Aa)^,^ + C2^4pj| < 1 } , 
A P ,i = {{i,j) ■ (pp)ij + dPij > 0 } , 
Alp, 2 = {(bi) '• |(Ap)jj + c 2 Pi,j\ < 1} , 
Alp = {(bj) : |(Ap )i,j +c 2 L id \ < 1 }, 


P-A,\ = {(bj) 

Pa ,2 = {(bi) 
Pp, 1 = {(bi) 
Pp,2 = {(bi) 

P L = {(bi) : 


( t L A)i,j + Cl Ajj < 0} , 

I (Aa)?;,j + c 2 Ai,j\ > 1} , 

(/^p)?,j P CiPj,j b 0} , 

l(Ap)i,j + C 2 Pi,j\ > 1} 7 
|(Ap)i,/ + c 2 L iy j\ > 1} , 


(4.19) 


then we can further reduce the previous 3 systems into the following much simpler ones thanks to direct 
substitutions and point-wise comparisons of the complementary conditions: 
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(4.20) 


( 1 ) For a fixed P, we have A = 0 on A a ,i U ^ 4 , 2 , while ( A , Ax) on Ia,\ H?a ,2 satisfies 

( 0 = 2 APP T - 2 YP t + a\ A , 

[0 = 1A^4 + C 2 Aj — (Ax + C 2 A) . 

(2) For the fixed A, L , P, Ap, we have P = 0 on Ap,i (J Alp, 2 , while (P, Ap) on Zp ssatisfies 

jo =- 2 A t Y + 2 A t AP + vXp + ~/X l R, 

Jo = Ap|Ap + C2PI — (Ap + C2P). 

(3) For a fixed P, we have L = 0 on Alp, while (Z, P, Ap) on Zp satisfies 

'l =PP t -I , 

P = P o T + T o P T o T , 

0 = Ap|Ap + C2-Z| — (Ap + C 2 Z). 


(4.21) 


(4.22) 


For the nonlinear constraints with Ax, Ap and Ap, we propose a semi-smooth Newton-step update as in (12) 
to solve the corresponding equations. One might suggest the explicit Uzawa iteration m instead, but it is 
only conditionally stable and converges slowly. We shall give only a sketch of the derivation of the semi-smooth 
Newton update, following the general principle in [ 12 ;. We first consider the system (4.201. Assume that (A, Ax) 
are perturbed to (A h ,X\) such that the increment is of order 0(h) and satisfies the second equation in ( |4.20| ). 
Then we can derive 


Ax|Ax + C 2 AI + Ax 


Xa + C2A 


. I Ax + C 2 AI 

which gives the following Newton update from (A, Xa) to (A + , A( 4 ): 


(A h A + c 2 A ft - Ax - c 2 A) - (X h A + c 2 A h ) = 0(h 2 ) 


^x|Ax + c 2 A| + A A 


Ax + C 2 A 


(A+ 4- C2^ + ) — Ax|Ax + c 2 A\ + (A(^ + c 2 A + ) . 


|Ax + c 2 A\ 

Now following m , we suggest the following Newton update involving damping and regularization: 


Aj(|Ax + c 2 A\ + 


Ax + c 2 A 


(A \ + c 2 A+) 


0X a 


— | A A + C 2 AI - 


0X A 


+ (Aj( + C 2 AA) 


|Ax + c 2 A| v a 'max(l,|Ax|) 'max(l, |Ax|) 

where 9 is a stability parameter and the regularizer Ax/max(l, |Ax|) is set to automatically restrict Ax to be 
in [-1, 1 ], Following [12], we set 9\X A + c 2 A\/ (jAx + c 2 A| - 1 + 9 ^ wllich S ives e < 1 
to simplify the iteration and leads to the following update after direct substitution: 

n _ „ 1 _ aAbA 4+ L 

0 — A a — C 2 — -~— A. &A 


dx-l 


where a a = 
first system 


mAAl) ’ bA ~ |Aa+c 2 A an< ^ ^ A = |Ax T C 2 AI, which is used as the semi-smooth update for the 


^A~\~C 2 A. 


(4.23) 


JO = 2 A+PP T - 2 YP t + aAj 
\o = ^x - afix i 1 “ a A^x) A+ + ax • 

We can linearize the constraints for the variables Ap and Ap similarly. 

We may solve the third system (4.22) for the other two variables (L, P), but it is actually not an easy job. 
Although the second equation in (4.22) is linear, it is computationally expensive as the transpose operator 
T is involved. We therefore derive a semi-smooth Newton update for P from L and P instead of a direct 
substitution. Assume (Z,P, P) are perturbed to (L h ,R h ,P h ) such that the increment is of order O(h) and 
satisfies L = PP T — I , we then have 


(. L h — L) = R h (P h - P) + 0(h 2 ), 
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which suggests the following update for R: 

P+(P+-P) =(L+-L). 

Combining this update with the aforementioned strategy for A l, we obtain the following semi-smooth Newton 
update from (L,P,P) to (L+,P + ,P+) for the third system (4.22): 


(L+ = P+(P+) T - /, 

ip+(P+-P) =(L+-L), 

Ui =^(l-a L %)L+-a L 


(4.24) 


where a L , b L and d L are given by a L = , b L = ^l+c 2 2 l\ and d L = \X L + c 2 L\. 


4.1.4 Numerical algorithms 

Combining all the techniques and results from the previous subsections, we are ready to propose the semi¬ 
smooth Newton method based on primal-dual active sets for solving the optimality system (4.16) to tackle the 
minimization problem (4.2). 

Semi-smooth Newton Algorithm 1 . Given two constants Ci,C 2 ; initialize (A 0 , P°,/i^, A^,/ip, Ap, A°). 
For k = 0,1,..., K , do the following steps : 

1. Compute /4 fc) := -2 A^P^P^ + 2Y(P (k '>) T - a\^ ] . 

2 . Set the active and inactive sets A Ai and I A i for i = 1 , 2 : 

a a\ = {(hi) : + ciA-j > 0} , 1 ( A \ = {( i,j ) : (ma)-j + ciAf] < 0} , 


-4A,2 — {(hi) : I (^A)ij + C 2 A^| < 1} , 




r(^) 

2 


= ■ \(Xa)^ + c 2 A^\>1}. 


(fc)i 


3. Compute a A \b A \d A ' > : 


Xk) 


W A k ) — + c 2 A ik '> ( fc) | (fe) .(k) | 

’ ,.(k) m’ i c 2 ^- |- 


max(l, |A^|) ’ |A^ + c 2 ^^T 


4. Set A^ fc+1 ) := 0 on LMa 2 > solve the system for (Al fe+1 ), A^ +1 ^) on I A \ D^-a 2 : 

'0 = 2 A( fe+ 1 )pWp! fe ) r - 2 F(Pl fc )) r + oA^ +1) , 

0 = A a + 1) - ^ (/ - 4 fc W] T ) ^ (fc+1) + af . 

5. Compute ^ := 2{A^ k+1 ^) T Y - 2 (A^+^A^+^pW - vX { P ] - 7 A R^ . 

6 . Set the active and inactive sets A Pi and I Pi for i = 1 , 2 : 


■4*1 = {(*, J) = MTJ + ciP^ > 0} , = {(», j) : MTJ + CiP%> < 0} , 

A [ p\ = ■■ l(Ap)j? + c 2 P^\ < 1} , p(fc) 


X k ) 




r(^) 


\(*0 


}(k) 


= : \(X p )%+c 2 P%\>1}. 


7. Compute Op^, bp ' 1 , dp J : 


( fc )_ Ap } (fc)_Ap } + c 2 P (fc) (fe)_| (fc) p(fe)i 

p ■“-n u(fc)n ’ bp 7V(*o )— dp ‘ _|Ap + 2F !■ 


dfc) 


Cl p ’ — 


max(l, |Ap^|) 


lA^+caPWr 
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8 . Set p( fe+1 ) := 0 on Ap\ (J Ap\ ; solve the system for (p( fe +l} ; Ap +1 ^) on Zp J fl^P 2 : 


0 = -2 (A( fe+1 )) r y + 2(A( fc+1 )) T A( fc + 1 )P (fc+1 l + !/A^ +1) + yA^P^ 

0 = Ap +1) - (/ - a^[6^] T ) P( fe +!) + ap). 


9. Set the active and inactive sets _4p fc) and 


4 fc) = {(*> j) ■■ I (A l)£ } + c 2 zg>| < 1} , zf = (OU) : |(A t )g + caZg | > 1} . 


10. Compute Op \ \ dp ■* : 

«f=- := .41! + 01 ■ II!, 4 l, :=IAg>+o a L<‘)|. 


max(l, |A^|) 


|Ap } + c 2 Zl fc )| 


11. Set Z (fc+1) - 0 on A {k) ; evaluate (Z( fc+1 ), p( fe+1 ), A^ fe+1) ) on X { L k) : 


r z( fe+i ) 

I jj(fe+i)^p(fc+i) _ p(fc)^ 
| Ap +1) 


= p(fc+i)(p(fc+i))r _ j ^ 

= (L( fe+1 ) - p( fc l) , 


iI r I (/-4‘ , [f'? ) ] T )r l ‘ +1) 



A natural choice of the stopping criterion is based on the changes of the active sets: if the active sets for two 
consecutive iterations are the same, we may stop the iteration m- As the iteration goes on, A , P, L become 
more and more sparse, and the sizes of the linear systems involved drop drastically, so the inversions of the 
linear systems are more stable and less expensive computationally. 

Finally, a few remarks are in order for effective implementations of the algorithm: 

1. With the enforcement of the constraints A, P > 0 by the dual variables ^^,/ip, the algorithm ensures 
naturally A^,PW > 0 for all k if the initial guesses and Pare set to be non-negative. Thus the 
algorithm can be simplified by setting the dual variables A^ and Ap ^ to be A ^ = Ap ' 1 = 1 and drop the 
active/inactive sets A^\, X^\, Ap\ and Zpp 

2. In order to further simplify the algorithm, we may normalize the row vectors of P after Step 8 so that 
PP T has unitary diagonal entries. If this normalisation is added, then L^ > 0 for all k. In this case Ap^ 
can be simply set to be X^ k> = 1 while A^ and Zp ' ) can be dropped. 

3. In the development of our algorithm above, we assume Y > 0 entry-wise, therefore it is natural to enforce 
the constraint A > 0. This non-negativity condition for A is however infeasible and shall be dropped if 
Y is not non-negative entry-wise. In this case, nonetheless, we can still utilize the above algorithm for 
a non-negative factorization with the following minor modification: drop the dual variable p ,a and the 
active/inactive sets A^\ and X^\. 


4.2 Non-negative matrix factorization of an image 


With Semi-smooth Newton Algorithm 1 to minimize the functional (4.2), we are ready to propose an 
algorithm to approximate Z“’"’ 7 (F) in (2.2) and Z“T ’ 7 (Y) in (2.20) for the NMF of an image Y . 

Non-negative Matrix Factorization Algorithm 2. Specify 5 parameters a, v, 7 , p. p. 

1. Apply Semi-smooth Newton Algorithm 1 to find a minimizer [Aq,Fo] of the problem: 


min ||Y - AV 1 \\ F2 + a 
A>o,v>o ’ 


F,i + v\\V\\ F , 1 +'r\\V T V-I\\ F , 1 . 
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2. Apply Semi-smooth Newton Algorithm 1 to find a minimizer [E (l , U t) of the problem: 

min |K-S T t/ T |||, 2 + a||E|| F , 1 + H|?7|k,i+7ll^ T t/-/||F,i. 

2j->U,U ^U 

3. Form 2™(F) := t/ 0 E 0 l/ 0 T from [U 0l 2 0 , Vo] • 

4. Sort the entries of Eo from the largest to the smallest as > cr,; 2:)2 > .. > cq 2 j 2 . 

5. Compute cq := a ll j l e; g) e;, then form Eo,p := ■ 

6. Form the factorisation Ip'-’ 1 {Y) := . 

4.3 Multi-level analysis algorithm based on NMF 

Based on the results from a NMF, we can propose a multi-level analysis algorithm. 

Multi-level Analysis Algorithm 3. Specify a scaling parameter r and a constant s m ax such that 
Smax 7 log A/logr, set paiameteis n, zz, y and 2 airays of parameters [p(l), • ••,‘p( K Smax )], W),-,P(s max)]- 
For s = 1,2,..., Smax, do the following steps: 

1. Compute l s {Y) as in ( |3.1| ). 

2. Calculate by Non-negative Matrix Factorization Algorithm 2. 

3. Calculate := o , S (F). 

5 Applications to photo images, EIT and DOT images 

In this section we shall apply both the NMF and the MLA framework of a NMF suggested in Section[f] 
to some photo images and several EIT and DOT images reconstructed by some direct sampling methods. We 
shall investigate two applications, the first one being an MLA for photo images using NMF, and the second one 
being an NMF over the images from an inversion algorithm for a broad class of coefficient determination inverse 
problems. In the first application, we aim at capturing features of different scales in an image and obtain 
a sparse low-rank representation of these features; while in the second application, we hope to identify the 
principal components in the image, which correspond to the signals coming from the inhomogeneous coefficients 
to be determined in the corresponding inverse problems, and remove artifacts and noise from the images. 


5.1 Applications to photo images 


We shall now perform an MLA using NMF for several grey-scaled images Y. In view of the fact that an 
image can be represented by a positive function, and so are the major structures/objects inside these images, 
we are naturally motivated to use the NMF to identify the principal components of the image corresponding 
to these major objects in the figure, and obtain a sparse representation of these objects and structures. MLA 
is employed to obtain these corresponding principal components representing structures/objects at multiple 
scales/levels of the image, so that structures of large scales and small scales in the image can be separately 
identified and sparsely represented. We shall also aim to obtain a sparse representation which is robust to noise 
during transmission of data through channels. But we would like to emphasize that we are neither aiming at 
reconstructing the image in full entity from all the NMF components in terms of tensor products, nor hoping to 
obtain a very high compression ratio of memory complexity to defeat any well-developed compression techniques, 
e.g. wavelet/curvelet compression, JPEG etc, since they are surely better candidates for compressions. Our 
major purpose is instead to identify and keep structures in the images in a robust manner. 

In the subsequent 4 examples, we shall utilize the Multi-level Analysis Algorithm 3 to approximate . (Y), 

in which the Non-negative Matrix Factorization Algorithm 2 is used to calculate Z’^’’[t s (T)] and the Semi¬ 


smooth Newton Algorithm 1 is used to minimize (4.2 1 for the NMF. In all the following examples, the parameters 
in Algorithm 3 are set to 

r = 2 , a = 0 . 2 , v = 0 . 02 , 7 = 0 . 02 , 
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whereas s max is set differently in each example. Considering the theoretical asymptotic order for an optimal 
choice of p as in (3.7), the array of parameters p(s) is set to 


p(s) = 


max(iV, M) 


max(l, logmax(7V, M) — 2s log r) 


,-«/2 


in all our examples, where [■] denotes the round-off function and K\ is a given constant. We observe from 
numerical experiments that this asymptotic formula (3.7) is, on one hand, necessary for good approximation of 


the desirable structures we hope to identify, and on the other hand, grows fairly slowly as the value s max —s grows 
and henceforth is a practical choice and very desirable for feature identifications and sparse representation. To 
ensure that the fidelitjyof the most important features in the image can be kept after dropping the less important 
components from the E PjP , the parameter p(s) is chosen by a threshold based on the lA-norm of E p , i.e. as the 
first integer such that 


p{s) p(s ) 2 

a idi > -^2 a ii.n > 
(=i i=i 


where K 2 is a threshold which is smaller than 1. In all the following examples, K\ 
as K\ = 3.5 and K 2 = 0.95. A quantization process 

.-■) ■= U 


which we get from Algorithm 2 as 


0.01 


and K 2 are always chosen 
is performed on all the three matrices [U p , E PiP , V p ] 
for any matrix (A, :? ). This is to minimize the number of 
possible choices of values in the matrix entries in order to embrace a possibility for an efficient entropy coding 
post-processing after the NMF process and minimize memory complexity. The parameters Ci,C 2 in Algorithm 
1 are always set to 1 . 

For the sake of comparisons between feature extraction, sparsity of representation and robustness against 
noise in the transmission channel, we shall also compare the performance of NMF with the ones by the SVD 
and the JPEG compression process. For any given image Y, the SVD with the level parameter s, Isvd,si is 
taken directly as 


Isvd,s :=l T s(UYV t ), where l s {Y) = UZV T . (5.1) 

Again, the same quantization process Q is performed on the three matrices [U, E, V] as described above to 
embrace a possibility for efficient entropy coding. Meanwhile, for the JPEG compression format, we follow 
the standard routine as in [23]. Namely we first perform a discrete cosine transform (DCT) on 8 x 8 pixel- 
blocks to give the DCT coefficients (Dy) on each block, then perform the standard JPEG quantization process 

with the given standard JPEG quantization matrix Q 50 (with quality Q = 50) [23]: 


Cij = 


Djj 

(.Quo) 


16 

11 

10 

16 

24 

40 

51 

61 

12 

12 

14 

19 

26 

58 

60 

55 

14 

13 

16 

24 

40 

57 

69 

56 

14 

17 

22 

29 

51 

87 

80 

62 

18 

22 

37 

56 

68 

109 

103 

77 

24 

35 

55 

64 

81 

104 

113 

92 

49 

64 

78 

87 

103 

121 

120 

101 

72 

92 

95 

98 

112 

100 

103 

99 


A level parameter s is introduced to define the image Ijpg,s as the reconstruction of the JPEG from only the 
first 2 3-s Fourier coefficients in each 8 x 8 pixel-blocks for s = 0,1, 2, 3. Note that, with this definition, only 4 
levels are available for JPEG. 

In order to test the robustness of the algorithms for feature preservation during the transmission process 
of data through channel, multiplication noise is added to simulate the scenario of data transmission through 
a noisy cable for each of the aforementioned algorithms, i.e. NMF, SVD and JPEG. For the NMF process, 
multiplicative noise is added to the three matrices [U p , S PiP , V p ] after quantization as 

(D p % = 0M 1 + vCij ), (E^)« = (S p ,p)«(l + 1 + , (5-2) 
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where l?™- (a) [i a {Y)\ := U P Y P}P V^ , l^ ) p{s) (Y) := i T a o1“’“£-^ol 8 {Y), a is the noise level and C is uniformly 
distributed between [—1,1]. Noisy reconstruction from the NMF is then given by 


rj-OL,V,^ 

s,pO),pO) 


on ,= 


(5.3) 


Similarly, for the SVD process, multiplicative noise is added in [U, E, V] after quantization such that 

• <-/): >:j,- = n,d +<o)- v^ = v tJ (i +<,), (5.4) 

where Isvd,s '■= ^ ( UY,V T ) and t s (F) := UY,V T . The noisy reconstruction Ig VD s is then taken as 

4vd, s ■■= l T s(U^(V<) t ). (5.5) 

For the JPEG process, multiplicative noise is added in DCT coefficients on each 8 x 8 pixel block after quanti¬ 
zation: 


C;, -C.,!' • rr,',,). (5.6) 

and the noisy reconstruction Ij PG s comes as the de-quantization of by multiplication by Q 50 followed by 
an inverse DCT. In all our numerical examples, we always set the noise level to be a = 25% 

The relative error of the reconstruction image / reconst from each reconstruction method is quantified in the 
following manner on the quotient space of L 2 after taking an affine equivalence: 

T \ _ m in a5 fr£]j£ H^Aeconst T 5 y|| ia 

£(,r recons tJ :— 11vri| 

ll y \\l 2 


This measurement of error is adopted because all the reconstructed images are shown such that the color scale 
gives only the relative contrast of the gray scale, and therefore an affine equivalence is taken for an appropriate 
measure of relative error. For each image, we shall also measure the memory complexity ratio of a given method, 
which is given as the ratio between the memory size of the data after performing the corresponding method 
and that of the original data. We would like to remark that the memory complexities for all the three methods 
(including JPEG) in our examples are computed based on its size before entropy coding; meanwhile, a same 
entropy coding technique can be applied to all the three methods considering the fact that all of them have 
undergone a quantization process. 

Example 1. In this example, we set Y as the grey-scale image presented in Figure [l] The parameter s max 
is chosen as s ma x = [log(min(iV, M))/log(r) — 3]. The resulting images from MLA without noise are shown 
in Figure [2] whereas reconstructions with 25% noise are given in Figure [3j The memory complexity ratios for 
the (Smax — s)-th level of the three methods and their respective relative L 2 errors with and without noise are 
shown as follows: 


°max 5 

1 

2 

3 

4 

5 

6 

P 

20 

24 

24 

28 

34 

42 

P 

142 

177 

152 

153 

195 

332 

memory complexity ratio of NMF 

0.0017 

0.0033 

0.0061 

0.0116 

0.0271 

0.0573 

memory complexity ratio of SVD 

0.0015 

0.0036 

0.0072 

0.0168 

0.0409 

0.1011 

memory complexity ratio of JPEG 

NA 

NA 

0.0154 

0.0497 

0.0982 

0.1048 

Relative L 2 error in NMF (with 0% noise) 

0.2723 

0.2567 

0.2350 

0.1878 

0.1630 

0.1561 

Relative L 2 error in SVD (with 0% noise) 

0.2733 

0.2584 

0.2342 

0.1875 

0.1591 

0.1594 

Relative L 2 error in JPEG (with 0% noise) 

NA 

NA 

0.1855 

0.0974 

0.0689 

0.0535 

Relative L 2 error in NMF (with 25% noise) 

0.2768 

0.2631 

0.2462 

0.2029 

0.1770 

0.1689 

Relative L 2 error in SVD (with 25% noise) 

0.2755 

0.2629 

0.2456 

0.2029 

0.1754 

0.1704 

Relative L 2 error in JPEG (with 25% noise) 

NA 

NA 

0.1941 

0.1089 

0.0711 

0.0673 


We can see from Figure [2] and [3] that in the absence of noise, although it is true that the NMF does not 
outperform SVD and JPEG of the same level, many reasonable details of different scales can already be captured 
in different levels of NMF, starting from the coarser image of the horse, then finer details and afterwards the 
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clear black-and-white strips on the horse. In each level, JPEG gives the best image of the three, however, it also 
needs a relatively high memory complexity in the same level. Meanwhile the NMF provides a representation 
of a relatively low memory complexity of the same layer. It is especially interesting to note that a memory 
complexity ratio of about 0.01 (before entropy coding) at level 4 can already give us many details of the horse. 
With the presence of noise, we can see that although the relative L 2 errors of both NMF and SVD are more or 
less the same, many coarser layers of SVD are not free from the contamination of noise in the form of vertical 
and horizontal strips in the background, and that the NMF gives a better shape of the horse. The NMF layers 
are affected by noise, but most of the nice details of the horse can still be kept. The JPEG stays the most 
robust against the noise, nonetheless, considering the fact that NMF of the same layer usually requires less than 
half of the memory as JPEG, the performance of NMF is already quite reasonable. 


Orignal 



Figure 1: Original image in Example 1 
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Figure 2: ML A for the image in Example 1 using NMF without noise 
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NMF: Level 5 


SVD: Level 5 


JPEG, Q=50: Level 5 
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JPEG, Q=50: Level 6 



100 200 300 400 500 600 700 800 900 


Figure 3: MLA for the image in Example 1 using NMF with 25% noise 
Example 2. In this example, we set Y as the image presented in Figure [4] The parameters are the same 
as in the previous example. The resulting images are shown in Figure [5] The memory complexity ratios for 
the ( s max — s)-th level of the three methods and their respective relative L 2 errors with and without noise are 
shown as follows: 


Smax S 

P 

P 

memory complexity ratio of NMF 
memory complexity ratio of SVD 
memory complexity ratio of JPEG 
Relative L 2 error in NMF (with 0% noise) 
Relative L 2 error in SVD (with 0% noise) 
Relative L 2 error in JPEG (with 0% noise) 
Relative L 2 error in NMF (with 25% noise) 
Relative L 2 error in SVD (with 25% noise) 
Relative L 2 error in JPEG (with 25% noise) 


1 

2 

3 

4 

5 

22 

21 

23 

28 

34 

143 

113 

118 

146 

208 

0.0047 

0.0075 

0.0140 

0.0309 

0.0711 

0.0053 

0.0103 

0.0225 

0.0548 

0.1330 

NA 

0.0155 

0.0364 

0.0619 

0.0716 

0.2945 

0.2749 

0.2503 

0.1966 

0.1693 

0.3024 

0.2770 

0.2553 

0.2075 

0.1717 

NA 

0.2487 

0.1827 

0.0884 

0.0663 

0.3104 

0.2842 

0.2614 

0.2225 

0.1899 

0.3040 

0.2867 

0.2536 

0.2298 

0.2024 

NA 

0.2664 

0.2025 

0.1250 

0.1082 


From Figures [5] and [6j finer and finer details are reasonably captured and present as the level number of the 
NMF layers increases, while a reasonably low compression ratio is attained. This time the memory complexity 
of JPEG becomes comparable to NMF. In each level, JPEG still gives the best image of the three on the same 
layer, however, we notice that with the same level of memory complexity, some of the NMF images can provide 
a finer layer of detail than the other two methods. With the presence of noise, we can see that although the 
relative L 2 errors of NMF actually outperform the ones of the SVD in some layers, the figures of all the three 
methods seem to be seriously contaminated. However, to our surprise, it seems that the figures of NMF seem 
more robust to keep the background clean, while the figures of the SVD are contaminated by random strips 
whereas the JPEG by random squares. In the coarsest levels, the SVD does not give a shape of a table, however, 
the NMF images still give a recognizable shape of a table. Moreover, the most detail of the table in the finer 
levels is still reasonably kept by the NMF in the presence of noise. 
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Figure 4: Original image in Example 2 
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Figure 5: ML A for the image in Example 2 using NMF without noise 
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Figure 6: MLA for the image in Example 2 using NMF with 25% noise 
Example 3. In this example, we use the same set of parameters as for the previous two examples except 
that we set s max = [log(min(IV, M))/ log(r) — 4] instead. Y is set as the image in Figure[7j The resulting images 
are shown in Figures [8] and [9j The memory complexity ratios for the ( s max — s)-th level of the three methods 
and their respective relative L 2 errors with and without noise are shown as follows: 


Smax S 

1 

2 

3 

4 

5 

V 

24 

24 

28 

34 

43 

p 

134 

129 

89 

120 

187 

memory complexity ratio of NMF 

0.0032 

0.0056 

0.0118 

0.0267 

0.0595 

memory complexity ratio of SVD 

0.0042 

0.0085 

0.0199 

0.0483 

0.1223 

memory complexity ratio of JPEG 

NA 

0.0155 

0.0480 

0.1383 

0.2005 

Relative L 2 error in NMF (with 0% noise) 

0.4060 

0.3814 

0.3588 

0.3391 

0.3113 

Relative L 2 error in SVD (with 0% noise) 

0.4075 

0.3818 

0.3633 

0.3386 

0.3112 

Relative 1? error in JPEG (with 0% noise) 

NA 

0.3542 

0.3062 

0.2120 

0.1502 

Relative L 2 error in NMF (with 25% noise) 

0.4164 

0.3901 

0.3760 

0.3505 

0.3310 

Relative L 2 error in SVD (with 25% noise) 

0.4112 

0.3912 

0.3745 

0.3499 

0.3346 

Relative L 2 error in JPEG (with 25% noise) 

NA 

0.3573 

0.3058 

0.2163 

0.1605 


We can see from Figure [8] that in the absence of noise, although JPEG again performs the best among the 
three on the same layer, it requires usually about 4 times of the memory than NMF due to the complexity of 
the figure. If we pick a memory complexity ratio of around 5 to 6 percent, then we can choose an NMF of the 
5-th level, while we can only choose a level 3 among the JPEG images which provides much less finer details 
of the building. With the presence of noise, the relative L 2 errors of the JPEG is the least among the three as 
shown in the above table. Nonetheless in Figure |9j we actually notice that the several NMF layers do not seem 
quite different from the ones without noise, whereas the SVD and the JPEG images are obviously contaminated 
respectively by straight strips and random squares. 
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Figure 7: Original image in Example 3 
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Figure 8: MLA for the image in Example 3 using NMF without noise 
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Figure 9: MLA for the image in Example 3 using NMF with 25% noise 
Example 4. In this last imaging example, the parameters are the same as in Example 3. Y is set as 
the image in Figure flO| and the resulting images are shown in Figure [IT] The memory complexity ratios for 
the ( s max — s)-th level of the three methods and their respective relative L 2 errors with and without noise are 
shown as follows: 


Smax S 

1 

2 

3 

4 

5 

V 

24 

24 

28 

34 

43 

p 

173 

74 

52 

34 

73 

memory complexity ratio of NMF 

0.0033 

0.0059 

0.0116 

0.0283 

0.0600 

memory complexity ratio of SVD 

0.0038 

0.0076 

0.0177 

0.0430 

0.1089 

memory complexity ratio of JPEG 

NA 

0.0156 

0.0297 

0.0609 

0.0753 

Relative L 2 error in NMF (with 0% noise) 

0.4766 

0.4283 

0.3673 

0.3109 

0.2808 

Relative L 2 error in SVD (with 0% noise) 

0.4763 

0.4239 

0.3638 

0.3133 

0.2813 

Relative L 2 error in JPEG (with 0% noise) 

NA 

0.3994 

0.2753 

0.1472 

0.1079 

Relative L 2 error in NMF (with 25% noise) 

0.4813 

0.4344 

0.3769 

0.3311 

0.3011 

Relative L 2 error in SVD (with 25% noise) 

0.4832 

0.4362 

0.3791 

0.3311 

0.3038 

Relative L 2 error in JPEG (with 25% noise) 

NA 

0.4221 

0.3172 

0.2203 

0.1967 


From this table we can see that, on the same layer, SVD always needs about a double of the memory than the 
NMF to just have a similar performance. Again, from Figure[lT] we infer that JPEG outperforms the other two 
methods at the same layer in the absence of noise. Nevertheless, if we choose a same memory complexity ratio 
e.g., 1.5 percent, we can actually get a 3rd layer of the NMF but only a 2nd layer of JPEG, and the relative error 
of the smaller-sized 3rd layer of NMF is actually smaller than the larger-sized 2nd layer of JPEG. Moreover, as 
we can see from Figures [Tl] and [l2j when the layers increase and finer details reveal, a level 4 of NMF is enough 
to read the Chinese characters which requires less than 0.03 percent of memory complexity. With the presence 
of noise, the relative error of the 4th layer of NMF where the Chinese characters are recognizable becomes 
comparable with the 3rd layer of JPEG, while their memory complexity is the same. Many of the NMF figures 
have less errors than the SVD figures on the same layers while the memory complexities of SVD are actually 
larger. Again, in Figure[l2j the SVD and the JPEG images are obviously contaminated respectively by straight 
strips and random squares, whereas the noise contamination in the NMF layers seem less obvious. 
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Figure 10: Original image in Example 4 
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Figure 12: MLA for the image in Example 4 using NMF with 25% noise 

5.2 Images reconstructed by DSMs 

In this subsection, we shall present the application of NMF to the images reconstructed by some recently 
developed inversion algorithms, namely the direct sampling methods (DSMs). The DSMs are a family of simple 
and efficient inversion methods which aim at providing a good estimate of the locations of inhomogeneities inside 
a homogeneous background representing various physical media from a single or a small number of boundary 
data in both full and limited aperture cases. They were studied in m m using far-field data and in [14] using 
near-field data for locating inhomogeneities in inverse acoustic medium scattering, and was later extended to 
various other coefficient determination inverse problems, such as the electrical impendence tomography (EIT) 
[3j, the diffusive optical tomography (DOT) [2] and the electromagnetic inverse scattering problem [Tlj. In each 
of the aforementioned tomographies, a family of probing functions is introduced and an indicator function is 
defined as a duality product between the observed data and the probing function under an appropriate choice 
of Sobolev scale. The index function, which we shall denote as a general image Y, represents the likelihood of 
whether a given sampling point sits inside an inhomogeneous inclusion. The evaluation of the index function 
is very inexpensive and works with quite limited measurement data, and the images obtained from the index 
functions are proven to be effective in locating abnormalities, especially robust against noise in the data. 

However, from our numerical experiments in the aforementioned references, we notice that, in exchange for 
the robustness of the DSM method and the cost-effectiveness of its evaluation, the DSM images usually contain 
some minor artifacts. These artefacts mainly come from the fact that the DSM image is actually the result of 
applying a kernel on a function with its support sitting inside the inclusions that we aim to locate. The DSM 
images we obtain are therefore usually quite diffusive and may consist of shadows and tails coming from the 
non-diagonal part of the kernel. 

Henceforth, a DSM image Y shall consist of 3 parts: the first part coming from the signals of the inho¬ 
mogeneous inclusions, the second as the contamination of the image by the non-diagonal part of the kernel, 
and the third part coming from noise in the measurement data. In view of the fact that both the DSM image 
that we obtain and a likelihood function are both positive, we shall therefore apply the NMF to the DSM 
images in the hope of identifying the principal components of the image corresponding to the signal from the 
inhomogeneous inclusions. As a remark, we would like to emphasize again that we are not aiming to reconstruct 
the original DSM image from all the components (in terms of tensor products) that we obtain from NMF, but 
only to look for principal components of the image containing signals from inhomogeneous inclusions and aim 
at reconstructing the inclusions themselves. 

In what follows, we shall apply the NMF to DSM images from two tomographies, namely the DOT and EIT. 
DOT is a popular non-invasive imaging technique that measures the optical properties of a medium and creates 
images which show the distribution of absorption coefficient inside the body. It is very useful for medical imaging, 
e.g. breast cancer imaging, brain functional imaging, stroke detection, muscle functional studies, photodynamic 
therapy, and radiation therapy monitoring; see [2], In our subsequent discussion, we consider the numerical 
experiments of the DOT using DSM as in Section 6 of [5], and the same numerical setting described therein. 
The medium coefficient inside all the inhomogeneous inclusions are set as /z = 50. The images generated from 
the scattered potential using the DSM algorithm described in that work are then put into Algorithm 2 for 
NMF, with parameters set to a = 0.2, v = 0 ,7 = 0.02, p = 5,p = 3 and c\ = C 2 = 1 in all the following examples. 

Example 5. In this example, we consider the case of two circular inclusions of radius 0.065, which are 
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respectively centered at (—0.5,0.25) and (0.25,0.15); see Figure 13 (top). The squared reconstructed images 
from the index / 2 described in j5J is presented in Figure 
for l = 1,2,3 after NMF obtained in Algorithm 2 are s 


13 


(second). The three images a il j l ( u p ) i( (g> (vp)j n 
rown in Figure 13 (third to fifth). The generalized 
eigenvalues are respectively given as {<J H j n }f =1 = {32.5522,21.1686,12.8299} in this example. The squared 
image of the final approximation to (I) after normalization is given in Figure 13 (last). From the figure, 

we can see that with an appropriate cutoff, e.g. a 50% cutoff, both the sizes and locations of inhomogeneities 
obtained from the image are reasonable accurate. 

Example 6. This example tests a medium with 4 circular inclusions of radius 0.065 with their corresponding 
positions: (—0.5,0.3), (—0.3,—0.1), (0.3, 0.1) and (0.5, 0.3); see Figure 14 (top). Figure [l4] (second) shows the 
squared reconstructed images from the index 1 2 described in [2]. Components cr,^, ( u p )j, <g) ( v p )j ,, for l = 1, 2, 3 
after NMF are shown in Figure 14 (third to fifth). The generalized eigenvalues are respectively given as 
{ a it ji }'?=! = {22.2455,16.7153,8.9511} in this example. Figure 14 (last) gives the squared image of the final 


approximation to Z“T’ 7 (/) after normalization. The principal components of the image coming from signals 
from the inclusions can be well obtained, with an observation that the first two components decomposed from 
NMF actually represent the inhomogeneous inclusions inside the original medium, 
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Figure 14: NMF decomposition of DSM image from DOT in Example 6 , with }f =1 = {22.2455,16.7153,8.9511} 
Next, we shall apply the NMF to the DSM images from EIT, which is an effective noninvasive evaluation 
method that creates images of the electrical conductivity of an inhomogeneous medium by applying currents at 
a number of electrodes on the boundary and measuring the corresponding voltages. It has found applications 
in many areas, such as oil and geophysical prospection, medical imaging, physiological measurement, early 
diagnosis of breast cancer, monitoring of pulmonary functions and detection of leaks from buried pipes, etc; 
see ref. in [3] . In what follows, we consider the same numerical setting as in the numerical experiments of EIT 
for a circular domain using DSM described in Section 6 in . The physical coefficient of the inhomogeneous 
inclusions are all set to a = 5. The images generated from the scattered potential field using the DSM algorithm 
are then put into Algorithm 2 for NMF, with parameters set to a = 0.2 ,is = 0 ,7 = 0.02 ,p = 5 ,p = 3 and 
Ci = C 2 = 1 in all the following examples. 

Example 7. We now investigate an example with 2 inclusions of size 0.1 x 0.1 respectively at the positions 
(—0.44,0.36) and (0.36, —0.44); see Figure [l5| (a). The squared reconstructed images from the indices I 2 after 
normalization as described in |3j is presented in Figure[l5](b). The components ayy, (TL)^ ®(vp)jn f° r ^ = 1,2,3 


obtained from NMF using Algorithm 2 over the image I are shown in Figure 15 (c-e). The generalized 
eigenvalues are respectively given as {oyj, }f^ t = {2.3712,2.3548,2.2904} in this example. The squared image 
of the approximation to 2 ipT , 7 (J) after normalization is in Figure 
inclusions sitting inside the original medium are decomposed into cu 
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(f). The components of inhomogeneous 


Tfferent components from the NMF. 



(a) 


(b) 


(c) 
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Figure 15: NMF decomposition of the DSM images from EIT in Example 7 , with = { 2 . 3712 , 2 . 3548 , 2 . 2904 } 

Example 8. In this example, we consider the case of 4 inclusions with same size as in Example 7 sitting inside 
the sampling region, which are placed at positions of (0.36, 0.36), (0.36, —0.44), (—0.44, 0.36) and (—0.44, —0.44); 
see Figure 


16 


a). The squared reconstructed images from the indices 1 2 after normalization is shown in Figure 
16 (b). Figure [l 6 ] (c-e) presents the images of a-pj, ( u p )i t <§) (v P )j n for l = 1,2,3 after NMF over the image I. 


The generalized eigenvalues are respectively given as }f =1 = {5.9647,4.2460,3.8970} in this example. The 
squared image of the approximation to I“T’ 7 (/) after normalization is in Figure fl 6 | (f). We can see that we can 
obtain fairly nicely the principal components of the image coming from signals from the inclusions. 



Figure 16: NMF decomposition of the DSM images from EIT in Example 8 , with }f =1 = {5.9647,4.2460,3.8970} 
Example 9. In this example, 2 inclusions of the same size as in Example 7 are introduced in the homogeneous 
background, and they are respectively placed at the positions (—0.36, 0.36) and (0.36, 0.36) inside the domain; 
see Figure 17 (a). The squared reconstructed images from the indices 1 2 after normalization is given in Figure 
17 (b). The images of a il j l ( u p ) i; ® ( v p )j n for l = 1,2,3 after NMF over the image I are shown in Figure 17 


(c^e). The generalized eigenvalues are respectively given as {eq Ui }f =1 = {3.9194,0,0} in this example. Figure 
0 (f) presents the squared image of the approximation to after normalization. From the figures, we 

can see that the principal components coming from the inclusions can be nicely obtained, and both the sizes 
and locations of inhomogeneities can be reasonably obtained from the NMF image after the introduction of a 
appropriate cutoff. 
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Figure 17: NMF decomposition of the DSM images from EIT in Example 9, with }f =1 = {3.9194,0,0} 
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